In [1]:
import pandas as pd
import numpy as np
import warnings
warnings.resetwarnings = True
In [2]:
df = pd.read_csv('health care diabetes.csv')
df.shape
Out[2]:
(768, 9)

Initial data exploration

1.1 Perform descriptive analysis. Understand the variables and their corresponding values. On the columns below, a value of zero does not make sense and thus indicates missing value:

In [3]:
df.describe()
Out[3]:
Pregnancies Glucose BloodPressure SkinThickness Insulin BMI DiabetesPedigreeFunction Age Outcome
count 768.000000 768.000000 768.000000 768.000000 768.000000 768.000000 768.000000 768.000000 768.000000
mean 3.845052 120.894531 69.105469 20.536458 79.799479 31.992578 0.471876 33.240885 0.348958
std 3.369578 31.972618 19.355807 15.952218 115.244002 7.884160 0.331329 11.760232 0.476951
min 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.078000 21.000000 0.000000
25% 1.000000 99.000000 62.000000 0.000000 0.000000 27.300000 0.243750 24.000000 0.000000
50% 3.000000 117.000000 72.000000 23.000000 30.500000 32.000000 0.372500 29.000000 0.000000
75% 6.000000 140.250000 80.000000 32.000000 127.250000 36.600000 0.626250 41.000000 1.000000
max 17.000000 199.000000 122.000000 99.000000 846.000000 67.100000 2.420000 81.000000 1.000000

Minimum values for Glucose, Blood Pressure, Skin Thickness, Insulin, BMI are 0. This are not correct values and need to be changes

In [4]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 9 columns):
Pregnancies                 768 non-null int64
Glucose                     768 non-null int64
BloodPressure               768 non-null int64
SkinThickness               768 non-null int64
Insulin                     768 non-null int64
BMI                         768 non-null float64
DiabetesPedigreeFunction    768 non-null float64
Age                         768 non-null int64
Outcome                     768 non-null int64
dtypes: float64(2), int64(7)
memory usage: 54.1 KB

There are no empty/null values

In [5]:
df.dtypes
Out[5]:
Pregnancies                   int64
Glucose                       int64
BloodPressure                 int64
SkinThickness                 int64
Insulin                       int64
BMI                         float64
DiabetesPedigreeFunction    float64
Age                           int64
Outcome                       int64
dtype: object

All the values are numeric. Outcome is the Target value and is categorical

In [6]:
df['Outcome'].value_counts()
Out[6]:
0    500
1    268
Name: Outcome, dtype: int64

Clearly an unbalanced dataset. No of patients with Diabetes is almost half of non daibetic

Data preparation and Visualizations

1.2. Visually explore these variables using histograms. Treat the missing values accordingly.

In [7]:
# Replacing incorrect values for Glucose, Blood Pressure, Skin Thickness, Insulin, BMI with correct values 
# Statistically Calculated
# Replacing all the 0 values for above mentioned column with medians for each of them
median_glucose = np.median(np.array(df.loc[df['Glucose'] != 0.0]))
median_bp = np.median(np.array(df.loc[df['BloodPressure'] != 0.0]))
median_skinThickness = np.median(np.array(df.loc[df['SkinThickness'] != 0.0]))
median_insulin = np.median(np.array(df.loc[df['Insulin'] != 0.0]))
median_bmi = np.median(np.array(df.loc[df['BMI'] != 0.0]))

df.loc[df['Glucose']== 0, 'Glucose'] = median_glucose
df.loc[df['BloodPressure']== 0, 'BloodPressure'] = median_bp
df.loc[df['SkinThickness']== 0, 'SkinThickness'] = median_skinThickness
df.loc[df['Insulin']== 0, 'Insulin'] = median_bmi
df.loc[df['BMI']== 0, 'BMI'] = median_bmi
In [8]:
df.describe()
Out[8]:
Pregnancies Glucose BloodPressure SkinThickness Insulin BMI DiabetesPedigreeFunction Age Outcome
count 768.000000 768.000000 768.000000 768.000000 768.000000 768.000000 768.000000 768.000000 768.000000
mean 3.845052 121.054688 70.244792 28.812500 91.973958 32.350651 0.471876 33.240885 0.348958
std 3.369578 31.422812 15.626757 8.806703 107.200561 6.932090 0.331329 11.760232 0.476951
min 0.000000 24.600000 24.000000 7.000000 14.000000 18.200000 0.078000 21.000000 0.000000
25% 1.000000 99.000000 62.000000 25.000000 25.000000 27.300000 0.243750 24.000000 0.000000
50% 3.000000 117.000000 72.000000 28.000000 30.500000 32.000000 0.372500 29.000000 0.000000
75% 6.000000 140.250000 80.000000 32.000000 127.250000 36.600000 0.626250 41.000000 1.000000
max 17.000000 199.000000 122.000000 99.000000 846.000000 67.100000 2.420000 81.000000 1.000000
In [9]:
# Understanding the distribution of each feature through histogram
import math
import plotly.express as px
for i in df.columns[0:-1]:
    counts, bins = np.histogram(df[i], bins=range(math.floor(min(df[i])), math.ceil(max(df[i])), math.ceil(max(df[i])/10)))
    bins = 0.5 * (bins[:-1] + bins[1:])
    fig = px.bar(x=bins, y=counts, labels={'x': i , 'y':'Count'})
    fig.show()
Pregnancies : Majority are between 0-2 and the count gradually goes down with increase in number of pregnancy Glucose : 581 females have normal level(71-140), 11 of them with low level(31-70) and remaining 147 with high level Blood Pressure : The numbers are takes as Diastolic figures for BP. Low BP (21-60) = 193, Normal BP(61-80) = 515 and Hign BP(81-110)= 59 Skin Thickness : The normal range is (21-30) = 578 and thin (<=20) = 59 and thick (>=31) = 130 Insulin : Normal range (30-200) = 663 , high level(>200) = 97 BMI : Majority of the patients lie in the in range of 21-40 and few above 40 and few below 20 Diabetic Pedigree function : 717 patient are between (0-1) and remaining above 1 Age : Number of patient are decreasing with increase in Age. Maximum lie in between (20 and 30)
In [10]:
# Generating Distribution plot for each feature
import plotly.figure_factory as ff
for i in df.columns[0:-1]:
    fig = ff.create_distplot([df[i]], [i])
    fig.show()

2.1. Check the balance of the data by plotting the count of outcomes by their value. Describe your findings and plan future course of action.

In [11]:
a, b = df['Outcome'].value_counts()
fig = px.pie(values = [a,b],labels=['Target 0', 'Target 1'], names=['Target 0', 'Target 1']
             ,title='Distribution of Outcome')
fig.show()
Unbalanced dataset with 65.1% Outcome = 0 and 34.9% Outcome = 1

2.2. Create scatter charts between the pair of variables to understand the relationships. Describe your findings.

In [12]:
# Finding all combinations of columns in df
import plotly.express as px
from itertools import combinations
comb = [comb for comb in combinations(list(df.columns), 2)]

# Ploting graphs for each combination
for i in comb:
    p = str(str(i[0])+' vs '+ str(i[1]))
    fig = px.scatter(df, x=df[i[0]], y=df[i[1]])
    fig.update_layout(title=p,xaxis_title=i[0],yaxis_title=i[1])
    fig.show()
    

2.3. Perform correlation analysis. Visually explore it using a heat map.

In [13]:
import cufflinks as cf
import plotly.offline
cf.go_offline()
cf.set_config_file(offline=False, world_readable=True)
df.corr().iplot(kind='heatmap',colorscale="Blues", title="Feature Correlation Matrix")
Age and Pregnancy are correlated with coefficeint of 0.5 Gluscose and Daibetes (Outcome) are also correlated with coefficeint 0.47 BMI and SkinThickness are correlated with 0.54 coefficeint Rest of the feature are mildly correlated with each other and are not very significant

3.1. Devise strategies for model building. It is important to decide the right validation framework. Express your thought process.

Strategy going forward :

1. This problem comes under Classification as the Outcome(Target value) is
    Boolean(0 and 1)
2. Algorithms to be looked into will be Decision Trees, Random Forests, SVMs 
    and Logistic Regression.
3. Since dataset is small Cross validation techniques to be used and dataset will be 
    split into Train and Test data in the ratio (80:20)
4. Dataset is unbalanced, hence validation mertrics wont be accuracy. 
    Performance of the model will be based on Recall and Precision.
5. Base model for performace check will be KNN algorithm.
6. Best performing algorithm from 4 will make the final model.

Building Models

In [14]:
# data set is skewed and hence the model will not be accurate with the same data
# Skew has to be removed and then modelling .
In [15]:
# check skew and remove
def check_skew(df,all_cols):
    make_transform = pd.DataFrame(columns=["Column","Skew","Kurtosis"])
    index = 0
    avg_skew = 0
    for c in all_cols:
        
        skew = df.loc[:,[c]].skew().item()
        kurto = df.loc[:,[c]].kurtosis().item()
        avg_skew = avg_skew+skew
        make_transform = make_transform.append({"Column":c,"Skew":skew,"Kurtosis":kurto},ignore_index=True)
    return make_transform,avg_skew/len(all_cols)

make_transform,avg_skew = check_skew(df, df.columns[0:-1])
print(make_transform)
print("Average Skew :",avg_skew)
                     Column      Skew  Kurtosis
0               Pregnancies  0.901674  0.159220
1                   Glucose  0.351823  0.035740
2             BloodPressure -0.795080  1.800710
3             SkinThickness  0.933596  5.491026
4                   Insulin  2.616352  9.256144
5                       BMI  0.612079  0.847757
6  DiabetesPedigreeFunction  1.919911  5.594954
7                       Age  1.129597  0.643159
Average Skew : 0.9587438504666445
c:\program files\python36\lib\site-packages\ipykernel_launcher.py:8: FutureWarning:

`item` has been deprecated and will be removed in a future version

c:\program files\python36\lib\site-packages\ipykernel_launcher.py:9: FutureWarning:

`item` has been deprecated and will be removed in a future version

In [38]:
# Average Skew is high and will impact the model overall performance.
# It should be reduced and brought between -0.5 and +0.5
In [17]:
import math
import scipy.stats as ss

def remove_skew(DF,include = None, threshold = 0.2):
    
    transform_master = pd.DataFrame(columns=["column","delta","lambda","skew_old","skew_new"])
    #Get list of column names that should be processed based on input parameters
    if include is None and exclude is None:
        colnames = DF.columns.values
    elif include is not None:
        colnames = include
    else:
        print('No columns to process!')
    
    #Helper function that checks if all values are positive
    
    def make_positive(series):
        minimum = np.amin(series)
        original = series[0]
        #If minimum is negative, offset all values by a constant to move all values to minimum positive 
        if minimum <= 0:
            series = series + abs(minimum) + 0.001
        delta = series[0]-original
        return series,delta
    
    #Go through desired columns in DataFrame
    
    for col in colnames:
        #Get column skewness
        skew = DF[col].skew()  #
        # If skewness is larger than threshold and positively skewed; If yes, apply appropriate transformation
        # Prefered transformation - CoxBox Transformation
        if abs(skew) > threshold and skew > 0:
            skewType = 'positive'
            #Make sure all values are positive
            DF[col],delta = make_positive(DF[col])    #
            DF[col],fitted_lambda = ss.boxcox(DF[col])   #
            skew_new = DF[col].skew()
        elif abs(skew) > threshold and skew < 0:
            skewType = 'negative'
            #Make sure all values are positive
            DF[col],delta = make_positive(DF[col])
            
            DF[col],fitted_lambda = ss.boxcox(DF[col]) #
            skew_new = DF[col].skew()
        #print("appending...",col,delta,fitted_lambda,skew,skew_new)
        
        transform_master = transform_master.append({"column":col,"delta":delta,"lambda":fitted_lambda,
                                                    "skew_old":skew,"skew_new":skew_new}
                                                        ,ignore_index=True)
    #print(transform_master)
    return DF,transform_master


df,transform_master = remove_skew(df,df.columns[0:-1])
transform_master = transform_master.set_index("column")
print(transform_master)
print('Average Skew : ',np.mean(np.array(transform_master['skew_new'])))
                          delta    lambda  skew_old  skew_new
column                                                       
Pregnancies               0.001  0.380935  0.901674 -0.525767
Glucose                   0.000  0.681995  0.351823  0.049752
BloodPressure             0.000  1.767709 -0.795080  0.084534
SkinThickness             0.000  0.575017  0.933596  0.091597
Insulin                   0.000 -0.537312  2.616352  0.282720
BMI                       0.000  0.038822  0.612079 -0.000475
DiabetesPedigreeFunction  0.000 -0.073108  1.919911  0.007927
Age                       0.000 -1.094423  1.129597  0.146919
Average Skew :  0.017150961875347315
In [18]:
df.head()
Out[18]:
Pregnancies Glucose BloodPressure SkinThickness Insulin BMI DiabetesPedigreeFunction Age Outcome
0 2.570075 42.825861 1085.389951 11.694273 1.531017 3.765572 -0.474866 0.901093 1
1 0.001000 28.877733 930.569880 10.317463 1.531017 3.499017 -1.088080 0.892411 0
2 3.171682 49.725401 881.273359 10.076623 1.531017 3.348952 -0.403329 0.893138 1
3 0.001000 29.844451 930.569880 8.812944 1.699088 3.561393 -1.912132 0.881083 0
4 -2.436169 40.553306 383.640759 11.694273 1.742515 4.052353 0.803134 0.893820 1
In [19]:
# Since there is a huge imbalance in the target, Using SMOTE to balance the target variable
from imblearn.over_sampling import SMOTE
features = df.drop(['Outcome'],axis=1)
target = df['Outcome']

smote = SMOTE()
labels, outcome = smote.fit_resample(features, target)

bal_df = pd.DataFrame(labels, columns=features.columns)
bal_df['Outcome'] = outcome

bal_df['Outcome'].value_counts().plot(kind='bar', title='Outcome');
Using TensorFlow backend.
In [20]:
from sklearn.model_selection import train_test_split as tts
features = bal_df.drop(['Outcome'],axis=1)
target = bal_df['Outcome']
X_train, X_test, Y_train, Y_test = tts(features, target, test_size=0.2, stratify=target,random_state=71)
X_train.shape, Y_test.shape
Out[20]:
((800, 8), (200,))
In [21]:
# Model 1 - Decision Trees
from sklearn.model_selection import cross_val_score
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score,classification_report,precision_score,recall_score,f1_score,roc_auc_score

dt = DecisionTreeClassifier(max_depth= 5, random_state=71)
cv_scores = cross_val_score(dt, X_train, Y_train, cv=10)
print("Average training scores for decision Trees :", np.mean(cv_scores))
dt.fit(X_train,Y_train)
# Checking Training scores
print(classification_report(Y_train,dt.predict(X_train)))
print("ROC AUC score",roc_auc_score(Y_train,dt.predict(X_train)))
# Checking Test scores
print(classification_report(Y_test,dt.predict(X_test)))
print("ROC AUC score",roc_auc_score(Y_test,dt.predict(X_test)))
Average training scores for decision Trees : 0.74
              precision    recall  f1-score   support

           0       0.81      0.86      0.84       400
           1       0.85      0.80      0.82       400

    accuracy                           0.83       800
   macro avg       0.83      0.83      0.83       800
weighted avg       0.83      0.83      0.83       800

ROC AUC score 0.83
              precision    recall  f1-score   support

           0       0.75      0.83      0.79       100
           1       0.81      0.72      0.76       100

    accuracy                           0.78       200
   macro avg       0.78      0.77      0.77       200
weighted avg       0.78      0.78      0.77       200

ROC AUC score 0.775
In [22]:
# Model 2 - Support Vector Machine - SVM
from sklearn.svm import SVC
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
svm = make_pipeline(StandardScaler(), SVC(gamma='auto'))
cv_scores = cross_val_score(svm, X_train, Y_train, cv=10)
print("Average training scores for SVM :", np.mean(cv_scores))
svm.fit(X_train,Y_train)
# Checking Training scores
print(classification_report(Y_train,svm.predict(X_train)))
print("ROC AUC score",roc_auc_score(Y_train,svm.predict(X_train)))
# Checking Test scores
print(classification_report(Y_test,svm.predict(X_test)))
print("ROC AUC score",roc_auc_score(Y_test,svm.predict(X_test)))
Average training scores for SVM : 0.78625
              precision    recall  f1-score   support

           0       0.88      0.77      0.82       400
           1       0.79      0.90      0.84       400

    accuracy                           0.83       800
   macro avg       0.84      0.83      0.83       800
weighted avg       0.84      0.83      0.83       800

ROC AUC score 0.8300000000000001
              precision    recall  f1-score   support

           0       0.78      0.78      0.78       100
           1       0.78      0.78      0.78       100

    accuracy                           0.78       200
   macro avg       0.78      0.78      0.78       200
weighted avg       0.78      0.78      0.78       200

ROC AUC score 0.78
In [23]:
# Model 3 - Random Forest 
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(max_depth=5, random_state=71)
cv_scores = cross_val_score(rf, X_train, Y_train, cv=10)
print("Average training scores for Random forest :", np.mean(cv_scores))
rf.fit(X_train,Y_train)
# Checking Training scores
print(classification_report(Y_train,rf.predict(X_train)))
print("ROC AUC score",roc_auc_score(Y_train,rf.predict(X_train)))
# Checking Test scores
print(classification_report(Y_test,rf.predict(X_test)))
print("ROC AUC score",roc_auc_score(Y_test, rf.predict(X_test)))
Average training scores for Random forest : 0.78
              precision    recall  f1-score   support

           0       0.89      0.78      0.83       400
           1       0.80      0.91      0.85       400

    accuracy                           0.84       800
   macro avg       0.85      0.84      0.84       800
weighted avg       0.85      0.84      0.84       800

ROC AUC score 0.84375
              precision    recall  f1-score   support

           0       0.80      0.78      0.79       100
           1       0.78      0.80      0.79       100

    accuracy                           0.79       200
   macro avg       0.79      0.79      0.79       200
weighted avg       0.79      0.79      0.79       200

ROC AUC score 0.79
In [24]:
# Model 4 - Logistic regression
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression(max_iter=10000,random_state=71)
cv_scores = cross_val_score(lr, X_train, Y_train, cv=10)
print("Average training scores for Logistic Regression :", np.mean(cv_scores))
lr.fit(X_train,Y_train)
# Checking Training scores
print(classification_report(Y_train,lr.predict(X_train)))
print("ROC AUC score",roc_auc_score(Y_train,lr.predict(X_train)))
# Checking Test scores
print(classification_report(Y_test,lr.predict(X_test)))
print("ROC AUC score",roc_auc_score(Y_test,lr.predict(X_test)))
Average training scores for Logistic Regression : 0.75
              precision    recall  f1-score   support

           0       0.75      0.73      0.74       400
           1       0.74      0.76      0.75       400

    accuracy                           0.74       800
   macro avg       0.74      0.74      0.74       800
weighted avg       0.74      0.74      0.74       800

ROC AUC score 0.7425
              precision    recall  f1-score   support

           0       0.76      0.78      0.77       100
           1       0.77      0.75      0.76       100

    accuracy                           0.77       200
   macro avg       0.77      0.77      0.76       200
weighted avg       0.77      0.77      0.76       200

ROC AUC score 0.765
In [25]:
# find best model, which is more important, recall or precision value- danger case
# build knn model
# find best model find features- add code to exisiting
# fine tune best model - hyperparamter tuning
# build final model
In [26]:
# Model 5 - Base model KNN 
from sklearn.neighbors import KNeighborsClassifier
cv_scores = []
for k in range(1,100,1):
    knn = KNeighborsClassifier(n_neighbors = k)
    scores = cross_val_score(knn,X_train,Y_train,cv = 10,scoring ="accuracy")
    cv_scores.append(scores.mean())
print("Best Scores - index : {} score :{}".format(cv_scores.index(max(cv_scores)),max(cv_scores)))

k = cv_scores.index(max(cv_scores))
knn = KNeighborsClassifier(n_neighbors= k+1)
cv_scores = cross_val_score(knn, X_train, Y_train, cv=10)
print("Average training scores for Logistic Regression :", np.mean(cv_scores))
knn.fit(X_train,Y_train)
# Checking Training scores
print(classification_report(Y_train,knn.predict(X_train)))
print("ROC AUC score",roc_auc_score(Y_train,knn.predict(X_train)))
# Checking Test scores
print(classification_report(Y_test,knn.predict(X_test)))
print("ROC AUC score",roc_auc_score(Y_test,knn.predict(X_test)))
Best Scores - index : 0 score :0.7525000000000001
Average training scores for Logistic Regression : 0.7525000000000001
              precision    recall  f1-score   support

           0       1.00      1.00      1.00       400
           1       1.00      1.00      1.00       400

    accuracy                           1.00       800
   macro avg       1.00      1.00      1.00       800
weighted avg       1.00      1.00      1.00       800

ROC AUC score 1.0
              precision    recall  f1-score   support

           0       0.82      0.69      0.75       100
           1       0.73      0.85      0.79       100

    accuracy                           0.77       200
   macro avg       0.78      0.77      0.77       200
weighted avg       0.78      0.77      0.77       200

ROC AUC score 0.77
  1. For the use case, the important metrics to be used is Recall(Diabetic) because all the diabetic patients have to be diagnosed and failing to do this is a major fallback.
  2. Precision cannot be used as metric because even with high precision score, if some diabetic patients are missed out then its a model drawback.
  3. Capturing all Diabetic people is the priority task here.
  4. As a secondary metric, ROC-AUC curve and F1 score will be used.
In [ ]:
 
a. Based on Recall, AUC-ROC score and F1 test scores the model performance is as : Logistic Regression < KNN < SVM < Decision Tree < Random Forest c. Final model for the use case - Random Forest
In [27]:
df.columns
Out[27]:
Index(['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin',
       'BMI', 'DiabetesPedigreeFunction', 'Age', 'Outcome'],
      dtype='object')
In [28]:
# Random Forest is the selected final model
# The current model is overfitted and has high bais -  Reduce the bias using Bagging 
In [29]:
# Comparing Standard Random Forest with Random Forest Bagged over 1000 iterations
from sklearn.ensemble import BaggingClassifier
comp_df = pd.DataFrame()
train_acc, train_pres, train_recall, train_f1, train_roc = [], [], [], [],[]
test_acc, test_pres, test_recall, test_f1, test_roc= [], [], [], [], []
b_train_acc, b_train_pres, b_train_recall, b_train_f1, b_train_roc = [], [], [], [],[]
b_test_acc, b_test_pres, b_test_recall, b_test_f1, b_test_roc= [], [], [], [], []

for i in range(100):
    rf = RandomForestClassifier(max_depth=5, random_state=i)
    rf.fit(X_train, Y_train)
    brf = BaggingClassifier(base_estimator=rf, random_state=i)
    brf.fit(X_train, Y_train)
    
    comp_df.at[i,'train_acc'], comp_df.at[i,'test_acc'] = accuracy_score(Y_train, rf.predict(X_train)), accuracy_score(Y_test, rf.predict(X_test))
    comp_df.at[i,'b_train_acc'], comp_df.at[i,'b_test_acc'] = accuracy_score(Y_train, brf.predict(X_train)), accuracy_score(Y_test, brf.predict(X_test))
    comp_df.at[i,'train_pres'], comp_df.at[i,'test_pres'] = precision_score(Y_train, rf.predict(X_train)), precision_score(Y_test, rf.predict(X_test))
    comp_df.at[i,'b_train_pres'], comp_df.at[i,'b_test_pres'] = precision_score(Y_train, brf.predict(X_train)), precision_score(Y_test, brf.predict(X_test))
    comp_df.at[i,'train_recall'], comp_df.at[i,'test_recall'] = recall_score(Y_train, rf.predict(X_train)), recall_score(Y_test, rf.predict(X_test))
    comp_df.at[i,'b_train_recall'], comp_df.at[i,'b_test_recall'] = recall_score(Y_train, brf.predict(X_train)), recall_score(Y_test, brf.predict(X_test))
    comp_df.at[i,'train_roc'], comp_df.at[i,'test_roc'] = roc_auc_score(Y_train, rf.predict(X_train)), roc_auc_score(Y_test, rf.predict(X_test))
    comp_df.at[i,'b_train_roc'], comp_df.at[i,'b_test_roc'] = roc_auc_score(Y_train, brf.predict(X_train)), roc_auc_score(Y_test, brf.predict(X_test))
    comp_df.at[i,'train_f1'], comp_df.at[i,'test_f1'] = f1_score(Y_train, rf.predict(X_train)), f1_score(Y_test, rf.predict(X_test))
    comp_df.at[i,'b_train_f1'], comp_df.at[i,'b_test_f1'] = f1_score(Y_train, brf.predict(X_train)), f1_score(Y_test, brf.predict(X_test))

print(comp_df.head())
   train_acc  test_acc  b_train_acc  b_test_acc  train_pres  test_pres  \
0    0.85125     0.800      0.84625       0.790    0.817156   0.794118   
1    0.84500     0.780      0.84500       0.785    0.806667   0.769231   
2    0.85500     0.780      0.83750       0.800    0.815556   0.774510   
3    0.85250     0.785      0.84750       0.805    0.814732   0.771429   
4    0.84875     0.795      0.84000       0.790    0.810690   0.780952   

   b_train_pres  b_test_pres  train_recall  test_recall  b_train_recall  \
0      0.815490     0.784314        0.9050         0.81          0.8950   
1      0.805310     0.771429        0.9075         0.80          0.9100   
2      0.804054     0.794118        0.9175         0.79          0.8925   
3      0.808889     0.790476        0.9125         0.81          0.9100   
4      0.804933     0.778846        0.9100         0.82          0.8975   

   b_test_recall  train_roc  test_roc  b_train_roc  b_test_roc  train_f1  \
0           0.80    0.85125     0.800      0.84625       0.790  0.858837   
1           0.81    0.84500     0.780      0.84500       0.785  0.854118   
2           0.81    0.85500     0.780      0.83750       0.800  0.863529   
3           0.83    0.85250     0.785      0.84750       0.805  0.860849   
4           0.81    0.84875     0.795      0.84000       0.790  0.857479   

    test_f1  b_train_f1  b_test_f1  
0  0.801980    0.853397   0.792079  
1  0.784314    0.854460   0.790244  
2  0.782178    0.845972   0.801980  
3  0.790244    0.856471   0.809756  
4  0.800000    0.848700   0.794118  
In [30]:
print("Test accuracy\nStandard Random Forest: {}\tBagged Random Forest: {} ".format(np.mean(comp_df['test_acc']),np.mean(comp_df['b_test_acc'])))
print("Test Recall\nStandard Random Forest: {}\tBagged Random Forest: {} ".format(np.mean(comp_df['test_recall']),np.mean(comp_df['b_test_recall'])))
print("Test F1\nStandard Random Forest: {}\tBagged Random Forest: {} ".format(np.mean(comp_df['test_f1']),np.mean(comp_df['b_test_f1'])))
print("Test ROC-AUC\nStandard Random Forest: {}\tBagged Random Forest: {} ".format(np.mean(comp_df['test_roc']),np.mean(comp_df['b_test_roc'])))
Test accuracy
Standard Random Forest: 0.7872	Bagged Random Forest: 0.7925 
Test Recall
Standard Random Forest: 0.805	Bagged Random Forest: 0.8087000000000002 
Test F1
Standard Random Forest: 0.7909002623585611	Bagged Random Forest: 0.7957997221925177 
Test ROC-AUC
Standard Random Forest: 0.7872	Bagged Random Forest: 0.7925 
There isnt much difference between Standard Random Forest and Bagged Random Forest, but since the scores are overfitted, Chosing Bagged Random Forest (May perform better after Hyper parameter Tuning)
In [31]:
from pprint import pprint
brf = RandomForestClassifier()
brf = BaggingClassifier(base_estimator=brf)
print("All paramters available for tuning")
pprint(brf.get_params())
All paramters available for tuning
{'base_estimator': RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=None, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=100,
                       n_jobs=None, oob_score=False, random_state=None,
                       verbose=0, warm_start=False),
 'base_estimator__bootstrap': True,
 'base_estimator__ccp_alpha': 0.0,
 'base_estimator__class_weight': None,
 'base_estimator__criterion': 'gini',
 'base_estimator__max_depth': None,
 'base_estimator__max_features': 'auto',
 'base_estimator__max_leaf_nodes': None,
 'base_estimator__max_samples': None,
 'base_estimator__min_impurity_decrease': 0.0,
 'base_estimator__min_impurity_split': None,
 'base_estimator__min_samples_leaf': 1,
 'base_estimator__min_samples_split': 2,
 'base_estimator__min_weight_fraction_leaf': 0.0,
 'base_estimator__n_estimators': 100,
 'base_estimator__n_jobs': None,
 'base_estimator__oob_score': False,
 'base_estimator__random_state': None,
 'base_estimator__verbose': 0,
 'base_estimator__warm_start': False,
 'bootstrap': True,
 'bootstrap_features': False,
 'max_features': 1.0,
 'max_samples': 1.0,
 'n_estimators': 10,
 'n_jobs': None,
 'oob_score': False,
 'random_state': None,
 'verbose': 0,
 'warm_start': False}
In [32]:
# Choosing important features for tuning
base_estimator__criterion = ['gini','entropy']
base_estimator__max_depth = [None,5,7,10]
base_estimator__min_samples_split = [2,5,10]
base_estimator__min_samples_leaf = [1,3,5]
base_estimator__max_features = [1.0,'sqrt','auto','log2',0.8]
base_estimator__n_estimators = [50,100,150]

grid = {'base_estimator__criterion': base_estimator__criterion,
               'base_estimator__n_estimators':base_estimator__n_estimators,
               'base_estimator__max_depth': base_estimator__max_depth,
               'base_estimator__min_samples_split': base_estimator__min_samples_split,
               'base_estimator__min_samples_leaf': base_estimator__min_samples_leaf,
               'base_estimator__max_features': base_estimator__max_features
       }
print(grid)
{'base_estimator__criterion': ['gini', 'entropy'], 'base_estimator__n_estimators': [50, 100, 150], 'base_estimator__max_depth': [None, 5, 7, 10], 'base_estimator__min_samples_split': [2, 5, 10], 'base_estimator__min_samples_leaf': [1, 3, 5], 'base_estimator__max_features': [1.0, 'sqrt', 'auto', 'log2', 0.8]}
In [33]:
# Performing Grid Search 
from sklearn.model_selection import GridSearchCV
grid_brf = GridSearchCV(brf, grid, cv = 3, verbose = 2, n_jobs = -1)
grid_brf.fit(X_train, Y_train)
Fitting 3 folds for each of 1080 candidates, totalling 3240 fits
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  33 tasks      | elapsed:   50.6s
[Parallel(n_jobs=-1)]: Done 154 tasks      | elapsed:  3.4min
[Parallel(n_jobs=-1)]: Done 357 tasks      | elapsed:  7.3min
[Parallel(n_jobs=-1)]: Done 640 tasks      | elapsed: 12.7min
[Parallel(n_jobs=-1)]: Done 1005 tasks      | elapsed: 19.7min
[Parallel(n_jobs=-1)]: Done 1450 tasks      | elapsed: 27.8min
[Parallel(n_jobs=-1)]: Done 1977 tasks      | elapsed: 38.3min
[Parallel(n_jobs=-1)]: Done 2584 tasks      | elapsed: 50.3min
[Parallel(n_jobs=-1)]: Done 3240 out of 3240 | elapsed: 64.9min finished
Out[33]:
GridSearchCV(cv=3, error_score=nan,
             estimator=BaggingClassifier(base_estimator=RandomForestClassifier(bootstrap=True,
                                                                               ccp_alpha=0.0,
                                                                               class_weight=None,
                                                                               criterion='gini',
                                                                               max_depth=None,
                                                                               max_features='auto',
                                                                               max_leaf_nodes=None,
                                                                               max_samples=None,
                                                                               min_impurity_decrease=0.0,
                                                                               min_impurity_split=None,
                                                                               min_samples_leaf=1,
                                                                               min_samples_split=2,
                                                                               min_weight_fraction_leaf=...
             param_grid={'base_estimator__criterion': ['gini', 'entropy'],
                         'base_estimator__max_depth': [None, 5, 7, 10],
                         'base_estimator__max_features': [1.0, 'sqrt', 'auto',
                                                          'log2', 0.8],
                         'base_estimator__min_samples_leaf': [1, 3, 5],
                         'base_estimator__min_samples_split': [2, 5, 10],
                         'base_estimator__n_estimators': [50, 100, 150]},
             pre_dispatch='2*n_jobs', refit=True, return_train_score=False,
             scoring=None, verbose=2)
In [34]:
best_grid = grid_brf.best_estimator_
print("Best parameters : ",best_grid)
Best parameters :  BaggingClassifier(base_estimator=RandomForestClassifier(bootstrap=True,
                                                        ccp_alpha=0.0,
                                                        class_weight=None,
                                                        criterion='entropy',
                                                        max_depth=10,
                                                        max_features='log2',
                                                        max_leaf_nodes=None,
                                                        max_samples=None,
                                                        min_impurity_decrease=0.0,
                                                        min_impurity_split=None,
                                                        min_samples_leaf=1,
                                                        min_samples_split=2,
                                                        min_weight_fraction_leaf=0.0,
                                                        n_estimators=50,
                                                        n_jobs=None,
                                                        oob_score=False,
                                                        random_state=None,
                                                        verbose=0,
                                                        warm_start=False),
                  bootstrap=True, bootstrap_features=False, max_features=1.0,
                  max_samples=1.0, n_estimators=10, n_jobs=None,
                  oob_score=False, random_state=None, verbose=0,
                  warm_start=False)
In [35]:
final_brf = BaggingClassifier(base_estimator=RandomForestClassifier(bootstrap=True,
                                                        ccp_alpha=0.0,
                                                        class_weight=None,
                                                        criterion='entropy',
                                                        max_depth=7,
                                                        max_features=1.0,
                                                        max_leaf_nodes=None,
                                                        max_samples=None,
                                                        min_impurity_decrease=0.0,
                                                        min_impurity_split=None,
                                                        min_samples_leaf=3,
                                                        min_samples_split=5,
                                                        min_weight_fraction_leaf=0.0,
                                                        n_estimators=150,
                                                        n_jobs=None,
                                                        oob_score=False,
                                                        random_state=None,
                                                        verbose=0,
                                                        warm_start=False),
                  bootstrap=True, bootstrap_features=False, max_features=1.0,
                  max_samples=1.0, n_estimators=10, n_jobs=None,
                  oob_score=False, random_state=None, verbose=0,
                  warm_start=False)
In [36]:
import pickle
from sklearn.model_selection import StratifiedKFold
kfold = StratifiedKFold(n_splits=20, shuffle=True, random_state=1)
# enumerate the splits and summarize the distributions
all_fold = pd.DataFrame()
i = 0
for train_ix, test_ix in kfold.split(features, target):
    X_train, X_test = features.iloc[train_ix], features.iloc[test_ix]
    Y_train, Y_test = target.iloc[train_ix], target.iloc[test_ix]
    final_brf.fit(X_train,Y_train)
    all_fold.at[i,'Model'] = 'model'+ str(i)
    all_fold.at[i,'train ac'] = accuracy_score(Y_train, final_brf.predict(X_train))
    all_fold.at[i,'train pre'] = precision_score(Y_train, final_brf.predict(X_train))
    all_fold.at[i,'train rec'] = recall_score(Y_train, final_brf.predict(X_train))
    all_fold.at[i,'train f1'] = f1_score(Y_train, final_brf.predict(X_train))
    all_fold.at[i,'train roc'] = roc_auc_score(Y_train, final_brf.predict(X_train))
    all_fold.at[i,'test ac'] = accuracy_score(Y_test,final_brf.predict(X_test))
    all_fold.at[i,'test pre'] = precision_score(Y_test,final_brf.predict(X_test))
    all_fold.at[i,'test rec'] = recall_score(Y_test,final_brf.predict(X_test))
    all_fold.at[i,'test f1'] = f1_score(Y_test,final_brf.predict(X_test))
    all_fold.at[i,'test roc'] = roc_auc_score(Y_test,final_brf.predict(X_test))
    f = 'models/model' + str(i) + '.pkl'
    print(f)
    pickle.dump(final_brf, open(f, 'wb'))
    i = i+1
all_fold
models/model0.pkl
models/model1.pkl
models/model2.pkl
models/model3.pkl
models/model4.pkl
models/model5.pkl
models/model6.pkl
models/model7.pkl
models/model8.pkl
models/model9.pkl
models/model10.pkl
models/model11.pkl
models/model12.pkl
models/model13.pkl
models/model14.pkl
models/model15.pkl
models/model16.pkl
models/model17.pkl
models/model18.pkl
models/model19.pkl
Out[36]:
Model train ac train pre train rec train f1 train roc test ac test pre test rec test f1 test roc
0 model0 0.874737 0.840996 0.924211 0.880642 0.874737 0.84 0.814815 0.88 0.846154 0.84
1 model1 0.875789 0.842610 0.924211 0.881526 0.875789 0.84 0.814815 0.88 0.846154 0.84
2 model2 0.878947 0.848837 0.922105 0.883956 0.878947 0.70 0.656250 0.84 0.736842 0.70
3 model3 0.881053 0.852140 0.922105 0.885743 0.881053 0.70 0.692308 0.72 0.705882 0.70
4 model4 0.870526 0.838462 0.917895 0.876382 0.870526 0.90 0.884615 0.92 0.901961 0.90
5 model5 0.883158 0.848659 0.932632 0.888666 0.883158 0.84 0.840000 0.84 0.840000 0.84
6 model6 0.873684 0.842004 0.920000 0.879276 0.873684 0.86 0.821429 0.92 0.867925 0.86
7 model7 0.864211 0.831418 0.913684 0.870612 0.864211 0.70 0.727273 0.64 0.680851 0.70
8 model8 0.870526 0.845098 0.907368 0.875127 0.870526 0.88 0.851852 0.92 0.884615 0.88
9 model9 0.876842 0.845560 0.922105 0.882175 0.876842 0.84 0.793103 0.92 0.851852 0.84
10 model10 0.880000 0.851852 0.920000 0.884615 0.880000 0.86 0.846154 0.88 0.862745 0.86
11 model11 0.880000 0.842505 0.934737 0.886228 0.880000 0.80 0.777778 0.84 0.807692 0.80
12 model12 0.872632 0.840385 0.920000 0.878392 0.872632 0.88 0.806452 1.00 0.892857 0.88
13 model13 0.875789 0.850688 0.911579 0.880081 0.875789 0.76 0.709677 0.88 0.785714 0.76
14 model14 0.875789 0.847953 0.915789 0.880567 0.875789 0.84 0.840000 0.84 0.840000 0.84
15 model15 0.878947 0.848837 0.922105 0.883956 0.878947 0.72 0.720000 0.72 0.720000 0.72
16 model16 0.880000 0.850485 0.922105 0.884848 0.880000 0.70 0.666667 0.80 0.727273 0.70
17 model17 0.877895 0.852652 0.913684 0.882114 0.877895 0.86 0.821429 0.92 0.867925 0.86
18 model18 0.880000 0.849130 0.924211 0.885081 0.880000 0.76 0.709677 0.88 0.785714 0.76
19 model19 0.873684 0.844660 0.915789 0.878788 0.873684 0.74 0.730769 0.76 0.745098 0.74
In [37]:
# Best model according to test scores
m = all_fold.iloc[all_fold['test rec'].idxmax()]['Model']
f = 'models/'+ str(m) +'.pkl'
f = open(f,'rb')
model = pickle.load(f)
print(classification_report(Y_test,model.predict(X_test)))
recall_score(Y_test,model.predict(X_test))
              precision    recall  f1-score   support

           0       0.95      0.76      0.84        25
           1       0.80      0.96      0.87        25

    accuracy                           0.86        50
   macro avg       0.88      0.86      0.86        50
weighted avg       0.88      0.86      0.86        50

Out[37]:
0.96
In [ ]: